Upstream merge #289

lgruen · 2023-05-05T02:00:30Z

Intermediate upstream merge in between 0.2.115 and 0.2.116 to pick up hail-is#12929, which should fix our unit test failures (see https://centrepopgen.slack.com/archives/C03FZL2EF24/p1683251418792629?thread_ts=1683251237.683889&cid=C03FZL2EF24).

Please merge, don't squash.

…form project to track true current state (hail-is#12882) I fixed some issues in `gcp` but there are enough difficult to resolve issues that I want to start a fresh terraform project to track reality as different from what we recommend a new user do. The fresh attempt project is `gcp-broad`. I plan to slowly expand that until it captures everything. From this point forward, for the subset of infrastructure in `gcp-broad`, we should not make manual changes. Instead, we should propose changes as PRs, review them, merge them, and apply the terraform by hand once the changes are in `main`.

I should have checked quay.io for 1.12.0, which doesn't exist: https://quay.io/repository/skopeo/stable?tab=tags&tag=v1.12.0 Even though 1.12.0 was released two weeks ago https://github.com/containers/skopeo/releases

`credentials_host_file_path` is unused in `main`. Deleting that allowed me to then delete `credentials_host_dirname`, after which `CloudUserCredentials.file_name` and `CloudUserCredentials.secret_name` were only used for `CloudUserCredentials.mount_path` so I merged those. `CloudUserCredentials.secret_data` and `CloudUserCredentials.hail_env_name` were unused.

NB, this is a stacked PR. To see just these changes see [this commit](hail-is@ae51e0a) --- [VPC Flow Logs](https://cloud.google.com/vpc/docs/flow-logs): > VPC Flow Logs records a sample of network flows sent from and received by VM instances, including > instances used as Google Kubernetes Engine nodes. These logs can be used for network monitoring, > forensics, real-time security analysis, and expense optimization I found the collection process the most elucidating part of the documentation. My summary of that process follows: 1. Packets are sampled on the network interface of a VM. Google claims an average sampling rate of 1/30. This rate reduces if the VM is under load. This rate is immutable to us. 2. Within an "aggregation interval", packets are aggregated into "records" which are keyed (my term) by source & destination. There are currently six choices for aggregation interval: 5s, 30s, 1m, 5m, 10m, and 15m. 3. Records are sampled. The sampling rate is a user configured floating point number (precision unclear) between 0 and 1. 4. Metadata is optionally added to the records. The metadata captures information about the source and destination VM such as project id, VM name, zone, region, GKE pod, GKE service, and geographic information of external parties. The user may elect to receive all metadata, no metadata, or a specific set of metadata fields. 5. The records are written to Google Cloud Logging. The pricing of VPC Flow Logs is described at the [network pricing page](https://cloud.google.com/vpc/network-pricing#network-telemetry). Notice that, if logs are only sent to Cloud Logging (not to BigQuery, Pub/Sub, or Cloud Storage): > If you store your logs in Cloud Logging, logs generation charges are waived, and only Logging charges apply. I believe in this phrase "logs generation charges" refers to *VPC Flow logs* generation charges. The Google Cloud Logging [pricing page](https://cloud.google.com/stackdriver/pricing#google-clouds-operations-suite-pricing) indicates that, after 50 GiB of free logs, the user is charged 0.50 USD per GiB of logs. Storage is free for thirty days and 0.01 USD per GiB for each additional day. We can calculate the cost of our logs as follows. Refer to the [definition of the record format](https://cloud.google.com/vpc/docs/flow-logs#record_format) for details. ```python3 ip_string = len("123.123.123.123") ip_connection = 4 + ip_string + ip_string + 4 + 4 date_time = len("1937-01-01T12:00:27.87+00:20") record_bytes = sum(( ip_connection, max(len('SRC'), len('DEST')), 8, 8, 8, date_time, date_time, )) assert record_bytes == 126 hours_per_month = 24 * 60 seconds_per_hour = 60 * 60 seconds_per_interval = 15 * 60 vms = 10000 sampling_rate = 0.5 connections_per_vm_per_aggregation_interval = 100 intervals_per_hour = seconds_per_hour / seconds_per_interval records_per_hour = intervals_per_hour * vms * connections_per_vm_per_aggregation_interval * sampling_rate bytes_per_hour = records_per_hour * record_bytes bytes_per_month = bytes_per_hour * hours_per_month GiB_per_month = bytes_per_month / 1024. / 1024 / 1024 USD_per_month = max(0, GiB_per_month - 50) * 0.5 print(GiB_per_month) print(USD_per_month) ``` This works out to 143 USD to run a 10,000 VM cluster 24 hours a day for 30 days. I suspect our average VM count in a month is closer to 10 which is within the free tier (340 MiB). I might be wrong abou the connections per vm per aggregation interval, but this is straightforward to monitor once we have the logs. For a sense of the cost landscape, these are all free: 1. 1000 VMs. 2. 500 VMs, with a sampling rate of 1. 3. 200 VMs, with a sampling rate of 1, with an interval of 5 minutes. 4. 10 VMs, with a sampling rate of 1, with an interval of 30 seconds. It's all linear, so if we need to halve the interval we can either change the sampling rate, reasses our expected number of VM-hours, or adjust the service fee accordingly. We can also assess the landscape of fees necessary to cover costs (ignoring the free 50 GiB): 1. 15 minute intervals, 0.5 sampling rate, 100 expected connections per vm per interval: 0.0000008 USD per core per hour. 2. 30 second intervals, 1.0 sampling rate, 100 expected connections per vm per interval: 0.00005 USD per core per hour. 2. 5 second intervals, 1.0 sampling rate, 100 expected connections per vm per interval: 0.0003 USD per core per hour. 2. 5 second intervals, 1.0 sampling rate, 1000 expected connections per vm per interval (1000 unique connections per second honestly seems to me quite remarkable performance): 0.003 USD per core per hour. ``` USD_per_core_per_hour = bytes_per_hour / vms / 1024. / 1024 / 1024 * 0.5 / 16 print(USD_per_core_per_hour) ``` --- # Conclusion I think we're safe to enable this with the parameters in this PR (15 minute intervals, 50% sampling). We can assess unknown parameters, like connections per vm, and get comfortable looking at these logs. Security constraints or observability demands may push us towards desiring more logs. If that occurs, we can assess the need for a new fee. Regardless, this fee appears to be small relative to the current cost of preemptible cores.

…l-is#12944) We always ensure that `cleanup` is run, but in order to ensure that `post_job_complete` is run we need to ensure we get through any computation in cleanup and `mark_complete` that could potentially hang. So we add timeouts to any yield points in those functions and broadly catch exceptions so we always continue in the cleanup/mark_complete process regardless of failure.

There was an extra check in the where statement for the token matching on the jobs billing table. The jobs billing table is the only one not parameterized with a token.

CHANGELOG: `hl.Struct` now has a correct and useful implementation of pprint. For structs with keys that were not identifiers, we produced incorrect `repr` output. For `pprint`, we just `pprint`'ed a dictionary (so you cannot even tell that the object was an `hl.Struct`). This PR: 1. Fixes `hl.Struct.__str__` to use the kwargs or dictionary representation based on whether the keys are Python identifiers. 2. Teaches `StructPrettyPrinter` to first try to `repr` the struct (this is what the default pretty printer does) 3. Teaches `StructPrettyPrinter` to properly pretty print a struct as an `hl.Struct` preferring the kwarg representation when appropriate. 4. Teaches `_same` to use pretty printing when showing differing records.

This is a mitigation for turning off cloudfuse until we understand why the unmount is not working properly.

Spark 3.3.0 uses log4j2. Note the "2". If you use the log4j1 programmatic reconfiguration system, you will break log4j2 for you and everyone else. The only way to recover from such a breakage is to use the log4j2 programmatic reconfiguration system. Changes in this PR: 1. Include JVM output in error logs when the JVM crashes. This should help debugging of JVM crashing in production until the JVM logs are shown on a per-worker page. 2. JVMEntryway is now a real gradle project. I need to compile against log4j, and I didn't want to do that by hand with `javac`. Ignore gradlew, gradlew.bat, and gradle/wrapper, they're programmatically generated by gradle. 3. Add logging to JVMEntryway. JVMEntryway now logs its arguments into the QoB job log. I also log exceptions from the main thread or the cancel thread into the job log. We also flush the logs after the main thread completes, the cancel thread completes, and when the try-catch exits. This should ensure that regardless of what goes wrong (even if both threads fail to start) we at least see the arguments that the JVMEntryway received. 4. Use log4j2 programmatic reconfiguration after every job. This restores log4j2 to well enough working order that, *if you do not try to reconfigure it using log4j1 programmatic configuration*, logs will work. All old versions of Hail use log4j1 programmatic configuration. As a result, **all old versions of Hail will still have no logs**. However, new versions of Hail will log correctly even if an old version of Hail used the JVM before it. 5. `QoBAppender`. This is how we always should have done logging. A custom appender which we can flush and then redirect to a new file at our whim. I followed the log4j2 best practices for creating a new appender. All these annotations, factory methods, and managers are The Right Way, for better or worse. If we ever ban old versions of Hail from the cluster, then we can also eliminate the log4j2 reconfiguration. New versions of Hail work fine without any runtime log configuration (thanks to `QoBAppender`). I would like to eliminate reconfiguration because log4j2 reconfiguration leaves around oprhaned appenders and appender managers. Maybe I'm implementing the Appender or Appender Manager interfaces wrong, but I've read over that code a bunch of times and I cannot sort out what I am missing.

See hail-is#12958. --------- Co-authored-by: Daniel Goldstein <danielgold95@gmail.com>

…es to req generation (hail-is#12929) CHANGELOG: Hail no longer officially supports Python 3.7. Combines hail-is#12927 and hail-is#12908. Many changes. All seem to be necessary together. --- I fixed any new mypy errors or deprecation warnings. I also cleaned up plots.py (which isn't CI'd by mypy) because I was in there and it was a mess. I unified all our pip-tools versions. Require pandas 2 now. That requires Bokeh 3.x.x. Fix the pinned-requirements.txt dependencies so they reflect the actual necessary runtime harmony. Upgraded Sphinx. The method names lose their fixed-width styling but I think it looks fine. Version policy added. Updating everything means Jinja2 can jump to the latest version too. Numpy deprecated some of its types, using `_` gets rid of the warning. --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

The old bucket did not use uniform access control and also was multi-regional (us). I created a new bucket using the random suffix ger0g which has uniform access control. I also switched the location to us-central1 (not pictured here because that is a variable). I copied all the JARs from `gs://hail-query/jars` to `gs://hail-query-ger0g/jars` using a GCE VM. Again, global-config is not present in our terraform, so I'll have to manually edit that to reflect this new location: `gs://hail-query-ger0g`. The deployment process is: 1. Edit global-config to reflect new bucket. 2. Delete batch and batch-driver pods. 3. Delete old workers. The rollback process (if necessary) is the same. Since this requires wiping the workers, I'll wait for a time when no one is on the cluster to do it. Any users using explicit JAR URLs will need to switch to `gs://hail-query-ger0g/...`.

…12960) Applies the most restrictive bind and event propagation settings to job container mounts. While user jobs do not have the capabilities to create mount points, overlapping mount points in the container config can inadvertently trigger mount propagation back to the host which we just never want.

…he JRE (hail-is#12968)

I updated terraform but 1. GCP Terraform state is still local on my laptop. 2. GCP Terraform appears to not configure global-config. As such, I cannot thread the name of the bucket through to the tests the way we do with TEST_STORAGE_URI. For now, I've hardcoded the name (which is what we were doing previously). When we eventually get to testing recreation of GCP in a new project we'll have to address the global config then.

… jobs (hail-is#12953) Fixes hail-is#12924

hail-is@1940547 should have incremented the worker version. New QoB logging is incompatible with old workers.

This PR just adds a test to make sure submounts don't cause deletion.

Because the python version in hail-ubuntu changed from 3.7 to 3.8, both of the `hail-pip-installed` dockerfiles were running 3.8. I changed the old 3.7 dockerfile to try to use 3.9

…-is#12979) Skipping these for the timebeing because they currently only pass for 16-core worker VMs and not for 8-core worker VMs.

) I think this change is better than hail-is#12975

)

…termediate

danking and others added 27 commits April 26, 2023 15:49

[build.yaml][ci] upgrade to skopeo 1.12.0 (hail-is#12937)

85cb06e

[query] revert to skopeo 1.11.2 (hail-is#12938)

91a3830

I should have checked quay.io for 1.12.0, which doesn't exist: https://quay.io/repository/skopeo/stable?tab=tags&tag=v1.12.0 Even though 1.12.0 was released two weeks ago https://github.com/containers/skopeo/releases

[batch] Fix bad insert attempt resources trigger (hail-is#12942)

bcc7cb1

There was an extra check in the where statement for the token matching on the jobs billing table. The jobs billing table is the only one not parameterized with a token.

[FS] Fix bug in GoogleStorageFS seek (hail-is#12945)

1389874

[is_transient_error] retry once ConnectionRefusedError (hail-is#12947)

e9b2ee6

[batch] Turn off support for cloudfuse (hail-is#12949)

e0dbb8b

This is a mitigation for turning off cloudfuse until we understand why the unmount is not working properly.

[batch] xfail highcpu cheapest test (hail-is#12959)

99c8062

See hail-is#12958. --------- Co-authored-by: Daniel Goldstein <danielgold95@gmail.com>

[gcr] eliminate GCR (hail-is#12963)

36fae4a

[docs] Clarify that developer dependencies require the JDK not just t…

bcd5bb5

…he JRE (hail-is#12968)

[hailtop.batch] Fix 12924 -- waiting on a batch with zero unsubmitted…

f601767

… jobs (hail-is#12953) Fixes hail-is#12924

[batch] increment worker instance version (hail-is#12955)

c1b6a62

hail-is@1940547 should have incremented the worker version. New QoB logging is incompatible with old workers.

[batch] Ensure submount in /io doesn't cause deletion (hail-is#12977)

330031a

This PR just adds a test to make sure submounts don't cause deletion.

[ci] Test pip-installed lints against python 3.9 (hail-is#12970)

6d43c7e

Because the python version in hail-ubuntu changed from 3.7 to 3.8, both of the `hail-pip-installed` dockerfiles were running 3.8. I changed the old 3.7 dockerfile to try to use 3.9

[batch] Skip currently incorrect tests after Azure price change (hail…

be52d40

…-is#12979) Skipping these for the timebeing because they currently only pass for 16-core worker VMs and not for 8-core worker VMs.

[batch] Don't rmtree if any errors occur while unmounting (hail-is#12985

78ee77e

) I think this change is better than hail-is#12975

[batch] Check /proc/mounts for straggler cloudfuse mounts (hail-is#12986

da6ba69

)

Merge remote-tracking branch 'upstream/main' into upstream-0.2.115-in…

cfc5628

…termediate

lgruen requested a review from illusional May 5, 2023 02:00

illusional approved these changes May 5, 2023

View reviewed changes

lgruen merged commit fc212f5 into main May 5, 2023

lgruen deleted the upstream-0.2.115-intermediate branch May 5, 2023 02:01

lgruen mentioned this pull request May 5, 2023

Revert "Upstream merge" #290

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upstream merge #289

Upstream merge #289

lgruen commented May 5, 2023

Upstream merge #289

Upstream merge #289

Conversation

lgruen commented May 5, 2023